8 research outputs found

    Līdzsvarots mūsdienu latviešu valodas tekstu korpuss un tā tekstu atlases kritēriji

    Get PDF
    <p><strong>THE BALANCED CORPUS OF MODERN LATVIAN AND THE TEXT SELECTION CRITERIA</strong></p><p><em>Summary</em></p><p>Recently <em>The Balanced Corpus of Modern Latvian</em> (~3.5 million running words) has been created in the Institute of Mathematics and Computer Science (IMCS) (see <a href="http://www.korpuss.lv" target="_blank">http://www.korpuss.lv</a>). The Corpus has been compiled from printed and electronic materials created after 1990. The Corpus is automatically morphologically tagged: for each token all the syntactically valid interpretations are stored.</p><p>Texts for the Corpus were chosen according to different text selection criteria: for instance, time, media, domain, etc. This article discusses the text selection criteria chosen for this Corpus, problems related to Corpus design and text selection criteria, solutions found for these problems and future plans regarding the Corpus.</p

    LaVA - Latvian Language Learner corpus

    Get PDF
    Funding Information: The work reported in this paper is a part of the project Development of Learner Corpus of Latvian: methods, tools and applications (Project No. lzp-2018/1-0527) that is being implemented at the Institute of Mathematics and Computer Science, University of Latvia (IMCS UL) since September 2018. The project is financed by Latvian Council of Science. This work is also a part of the Latvian State Research Programme Letonika - Fostering a Latvian and European Society project Research on Modern Latvian Language and Development of Language Technology (No. VPP-LETONIKA-2021/1-0006) and has received financial support from the Latvian Language Agency through the grant agreement No. 4.6/2019-029. Publisher Copyright: © European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0.This paper presents the Latvian Language Learner Corpus (LaVA) developed at the Institute of Mathematics and Computer Science, University of Latvia. LaVA corpus contains 1015 essays (190k tokens and 790k characters excluding whitespaces) from foreigners studying at Latvian higher education institutions and who are learning Latvian as a foreign language in the first or second semester, reaching the A1 (possibly A2) Latvian language proficiency level. The corpus has morphological and error annotations. Error analysis and the statistics of the LaVA corpus are also provided in the paper. The corpus is publicly available at: http://www.korpuss.lv/id/LaVA.publishersversionPeer reviewe

    Corpus Based Self-Assessment Platform for Latvian Language Learners

    Get PDF
    Funding Information: This work is also a part of the National Research Programme Digital Resources of the Humanities project Digital Resources for Humanities: Integration and Development (No. VPP-IZM-DH-2020/1-0001) and has received financial support from the Latvian Language Agency through the grant agreement No. 4.6/2019-029. Publisher Copyright: Copyright © 2022 the American Physiological Society.This paper presents a self-assessment platform for Latvian language learners in the breakthrough (A1) and Waystage (A2) levels. The self-assessment platform contains three types of exercises (typing, inflection and gap filling) based on error analysis of the Latvian Language Learner corpus (LaVA). All exercises are automatically generated based on data from multiple corpora. The automatically generated exercises are useful not only for learners outside of classroom or even outside of any formal education setting, but also for educators and authors of learning aids. Currently the self-assessment platform is tailored for language learners at the beginner level, but it can be easily extended for more advanced levels. The self-assessment platform is freely available online (http://uzdevumi.riks.korpuss.lv/en/) and the interface is translated in two language – Latvian and English.publishersversionPeer reviewe

    A Prague Markup Language profile for the SemTi-Kamols grammar model

    Get PDF
    Proceedings of the 18th Nordic Conference of Computational Linguistics NODALIDA 2011. Editors: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa. NEALT Proceedings Series, Vol. 11 (2011), 303-306. © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/1695

    Mērķhipotēžu izvirzīšana latviešu valodas apguvēju korpusā

    Get PDF
    Funding Information: This work has received financial support from the Latvian Council of Science under the grant agreement No. lzp-2018/1-0527 (“Development of Learner Corpus of Latvian: methods, tools and applications”) in synergy with the Latvian State Research Programme “Latvian Language”, agreement No. VPP-IZM-2018/2-0002 (subproject “Acquisition of Latvian Language”). Keywords: corpus; learner corpus; target hypothesis; language acquisition; error annotation; corpus linguistics. Publisher Copyright: © 2020 University of Latvia. All rights reserved.Apguvēju korpuss ir sistemātiski datorizētu valodas apguvēju (gan svešvalodas, gan otrās valodas) veidotu tekstu datubāze. Tas ir ārvalstnieku valodas apguvēju īpatnību izpētes un datos balstītu latviešu valodas mācību materiālu un metodisko līdzekļu izstrādes pamats. Apguvēju korpusu, tāpat kā citus valodas korpusus, var marķēt dažādos valodas līmeņos (morfoloģiski, sintaktiski), bet īpaši nozīmīgs apguvēju valodas izpētē ir kļūdu marķējums un tajā balstītā kļūdu analīze. Kļūdu analīzi ietekmē divi faktori: 1) izraudzītie kļūdu tipi jeb kļūdu tipoloģija un 2) izvirzītās mērķhipotēzes, t. i., labotais teksts. Tādēļ pirms kļūdu marķēšanas ir būtiski vienoties, kas tiks marķēts un kā tas tiks darīts. Raksta ievadā ir īsi raksturots veidojamais „Latviešu valodas apguvēju korpuss” (LaVA), aplūkots mērķhipotēzes jēdziens un mērķhipotēzes nozīme valodas apguvēju korpusa izveides procesā. Rakstā ir izklāstīti galvenie mērķhipotēzes izvirzīšanas principi korpusā LaVA, kā arī minēti konkrēti piemēri, kā valodas apguvēju izteikumi tiek laboti atbilstoši latviešu valodas normām un kādas ir būtiskākās atkāpes, kas tiek pieļautas A learner corpus is a computerized textual database of the language produced by foreign language learners. Such corpus enables researchers to create more efficient learning materials and teaching methodology for language learners by using the corpus-driven error analysis. The learner's corpus, like other language corpora, can be annotated at different language levels (morphologically, syntactically); however, corpus-based error annotation and the corpus-based error analysis are especially important in the learner's language research. Error analysis is influenced by certain factors: 1) the error types setup or error typology; and 2) target hypothesis setup, e. g., corrected text. Therefore, it is crucial to have special guidelines indicating the subject of annotation and the methods how the annotation is performed. The article begins with description of “The Latvian Learner corpus” (LaVA) and its initial development strategies, the term of target hypothesis and its role in the creation of the learner corpus. The main target hypothesis setup criteria in the LaVa corpus is also provided with the examples showing how the language learners' utterances are being corrected according to the language norms, and the main deviations from the rules allowed.publishersversionPeer reviewe

    Pasīva konstrukcijas lietuviešu un latviešu valodā

    No full text
    The aim of this article is to analyze passive constructions in Latvian and Lithuanian. The correspondences of the Lithuanian past passive and present passive participles as the predicates (that together with the auxiliary constitute the passive forms in Lithuanian) in Latvian were analyzed in the Lithuanian-Latvian-Lithuanian parallel corpus (LiLa). This research is limited to the analysis of the correspondences of Lithuanian passive constructions in the passive voice in Latvian. The analyzed data show that the compound tense forms of the passive in Latvian are the first correspondence of the Lithuanian passive constructions with the past passive participle as a predicate. These constructions are in the passive voice in both languages, and both have resultative meaning (stative passive). The simple tense forms of the passive in Latvian are the second correspondence of the Lithuanian passive constructions with the present passive participle as a predicate. These constructions are also in the passive voice in both languages, and both have process meaning (dynamic passive). This study proved that the passive voice constructions are more often used in Lithuanian in comparison with Latvian, as many correspondences of Lithuanian passive voice constructions were identified in the active voice in Latvian

    Lithuanian present passive participle and past passive participle as predicates and correspondences in Latvian

    No full text
    The objective of this study is to establish the correspondences of the Lithuanian past passive and present passive participles as predicates in Latvian. This study is based on the Lithuanian-Latvian-Lithuanian Parallel Corpus (LiLa) Lithuanian-Latvian subcorpus of different genres (fiction, journalism, documents, etc.) that contains ~3.5 million running words. Based on the analysed data, it is proved that past passive and present passive participles as predicates in Lithuanian are used more widely than in Latvian. It was established that Lithuanian past passive and present passive participle as predicate has various correspondences in Latvian: 1) The passive voice and past passive participle as predicate (ir/tika mests). The sentences have been used in passive in both languages. Usually the corresponding Latvian predicate has been used in the same tense and mood as in Lithuanian. 2) The active voice and active verb forms as predicate (meta). The examples in Lithuanian have been used in passive, but corresponding sentences in Latvian have been used in active voice. Not all the correspondences are regular. There are some examples with the corresponding predicate in Latvian that has been used in debitive – a mood that does not exist in Lithuanian. The necessity has been expressed by other means in Lithuanian. 3) The passive present participle as predicate (ir metams). The passive present participle in Latvian usually expresses necessity (akmens ir metams) and possibility (celtne ir ieraugāma). Although it is proved that the active voice in Latvian has been used more widely than in Lithuanian, many examples in active voice in Latvian clearly have a meaning of passive voice

    Lithuanian-Latvian, Latvian-Lithuanian parallel corpus (LILA)

    No full text
    Paper presents a new linguistic resource, LILA, which is the Lithuanian-Latvian-Lithuanian parallel corpus aligned on paragraph and sentence level. The total size of the LILA corpus is 9 m words. So far it is a unique resource for this language pair. The corpus contains metadata with bibliographical information (title, author, year of publishing, etc.). The corpus contains the structural annotation, which includes boundaries of aligned segments, paragraphs, and sentences. The alignment of paragraphs and sentences has been done by the semi-automatic alignment tool Aligner 2.0.6.7. The corpus was compiled during 2011-2012 by scientists of the Vytautas Magnus University’s Centre of Computational Linguistics (VMU CCL) and the Latvian University’s Mathematical and Informatics Institute’s Laboratory of Artificial Intelligence (LU MII). The paper describes problems and challenges that need to be solved, when a parallel corpus for two small languages is created. The limited choice of appropriate parallel material poses the most difficult obstacle, as then it is difficult to compile a corpus of desired size. The paper presents: the conception and structure of the LILA corpus, phases of its compilation, the alignment tool, the query system, and examples of usage. The corpus is especially useful for teaching and learning languages, for comparing languages, for compilation of dictionaries, and for developing language technology tools (e. g. statistical machine translation systems)
    corecore